Computer and Modernization ›› 2011, Vol. 193 ›› Issue (9): 1-4.doi: 10.3969/j.issn.1006-2475.2011.09.001

• 算法设计与分析 •     Next Articles

Focused Crawler Based on Improved Algorithm of Web Content Similarity

WEI Jing-jing1, YANG Ding-da2, LIAO Xiang-wen2   

  1. 1.Department of Electronics and Information Science, Fujian Jiangxia University, Fuzhou 350108, China; 2.College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350108, China
  • Received:2011-05-06 Revised:1900-01-01 Online:2011-09-22 Published:2011-09-22

Abstract: Focused crawler is an important part of the vertical search engine. The Web content relevance algorithm of traditional focused crawler only considers term frequency, ignores the location information of key terms. After the analysis of the focused crawler based on the Web content relevance, this paper proposes an improved method of calculating relevance using the features of HTML tags. Experimental results show that the average accuracy of improved algorithm is 64.99% and increases 15.37% compared to the original method.

Key words: search engine, focused crawler, similarity, vector space model, HTML tags

CLC Number: